Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of \emph{adapting large-scale models for zero-shot adversarial robustness}. We first identify two key factors during model adaption -- training losses and adaptation methods -- that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of over 31 points over ImageNet and 15 zero-shot datasets. We hope this work can shed light on understanding the zero-shot adversarial robustness of large-scale models.
translated by 谷歌翻译
Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a ``doubly right'' object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a ``why prompt,'' which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.
translated by 谷歌翻译
Deep networks for computer vision are not reliable when they encounter adversarial examples. In this paper, we introduce a framework that uses the dense intrinsic constraints in natural images to robustify inference. By introducing constraints at inference time, we can shift the burden of robustness from training to the inference algorithm, thereby allowing the model to adjust dynamically to each individual image's unique and potentially novel characteristics at inference time. Among different constraints, we find that equivariance-based constraints are most effective, because they allow dense constraints in the feature space without overly constraining the representation at a fine-grained level. Our theoretical results validate the importance of having such dense constraints at inference time. Our empirical experiments show that restoring feature equivariance at inference time defends against worst-case adversarial perturbations. The method obtains improved adversarial robustness on four datasets (ImageNet, Cityscapes, PASCAL VOC, and MS-COCO) on image recognition, semantic segmentation, and instance segmentation tasks. Project page is available at equi4robust.cs.columbia.edu.
translated by 谷歌翻译
Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.
translated by 谷歌翻译
Small differences in a person's motion can engage drastically different muscles. While most visual representations of human activity are trained from video, people learn from multimodal experiences, including from the proprioception of their own muscles. We present a new visual perception task and dataset to model muscle activation in human activities from monocular video. Our Muscles in Action (MIA) dataset consists of 2 hours of synchronized video and surface electromyography (sEMG) data of subjects performing various exercises. Using this dataset, we learn visual representations that are predictive of muscle activation from monocular video. We present several models, including a transformer model, and measure their ability to generalize to new exercises and subjects. Putting muscles into computer vision systems will enable richer models of virtual humans, with applications in sports, fitness, and AR/VR.
translated by 谷歌翻译
We introduce a framework for navigating through cluttered environments by connecting multiple cameras together while simultaneously preserving privacy. Occlusions and obstacles in large environments are often challenging situations for navigation agents because the environment is not fully observable from a single camera view. Given multiple camera views of an environment, our approach learns to produce a multiview scene representation that can only be used for navigation, provably preventing one party from inferring anything beyond the output task. On a new navigation dataset that we will publicly release, experiments show that private multiparty representations allow navigation through complex scenes and around obstacles while jointly preserving privacy. Our approach scales to an arbitrary number of camera viewpoints. We believe developing visual representations that preserve privacy is increasingly important for many applications such as navigation.
translated by 谷歌翻译
Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.
translated by 谷歌翻译
变异自动编码器(VAE)遭受后塌陷的苦难,其中用于建模和推理的强大神经网络在没有有意义使用潜在表示的情况下优化了目标。我们引入了推理评论家,通过需要潜在变量和观测值之间的对应关系来检测和激励后塌陷。通过将批评家的目标与自我监督的对比表示学习中的文献联系起来,我们从理论和经验上展示了优化推论批评家在观察和潜伏期之间增加相互信息,从而减轻后验崩溃。这种方法可以直接实施,并且需要比以前的方法要少得多的培训时间,但在三个已建立的数据集中获得了竞争结果。总体而言,该方法奠定了基础,以弥合先前与各种自动编码器的对比度学习和概率建模的框架,从而强调了两个社区在其交叉点上可能会发现的好处。
translated by 谷歌翻译
许多机器学习方法通过在推理时反转神经网络来运作,这已成为解决计算机视觉,机器人技术和图形中反向问题的流行技术。但是,这些方法通常涉及高度非凸损失景观的梯度下降,从而导致优化过程不稳定和缓慢。我们介绍了一种学习损失景观的方法,其中梯度下降是有效的,从而为反转过程带来了巨大的改进和加速。我们在许多生成和判别任务的方法上证明了这一优势,包括GAN倒置,对抗防御和3D人类姿势重建。
translated by 谷歌翻译
3D重建是计算机视觉中的一个基本问题,当重建对象被部分或完全遮住时,任务尤其具有挑战性。我们介绍了一种使用未观察到的对象施放的阴影的方法,以推断遮挡背后的3D卷。我们创建一个可区分的图像形成模型,使我们能够共同推断物体的3D形状,其姿势和光源的位置。由于该方法是端到端可区分的,因此我们能够集成对象几何学的学习先验,以生成不同对象类别的现实3D形状。实验和可视化表明该方法能够生成与阴影观察一致的多个可能的解决方案。即使光源和物体姿势的位置都未知,我们的方法也起作用。我们的方法对现实世界的图像也很强,而地面真实的阴影面罩未知。
translated by 谷歌翻译